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Abstract 

We present a derivation of the Kullback Leibler (KL)-Divergence (also 
known as Relative Entropy) for the von Mises Fisher (VMF) Distribution 
in d— dimensions. 


1 Introduction 

The von Mises Fisher (VMF) Distribution (also known as the Langevin Distri¬ 
bution 0 ) is a probability distribution on the (d— 1 (-dimensional hypersphere 
gd,-i j n jg>d ( 3 J _ If d = 2 the distribution reduces to the von Mises distribution 
on the circle, and if d = 3 it reduces to the Fisher distribution on a sphere. 
It was introduced by o) and has been studied extensively by p, [7]. The first 
Bayesian analysis was in [5] and recently it has been used for clustering on a 
hypersphere by [ 2 j. 



Figure 1: Three sets of 1000 points sampled from three VMF distributions on 
the 3D sphere with re = 1 (blue), re = 10 (green) and re = 100 (red), respectively. 
The mean directions are indicated with arrows. 
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2 Preliminaries 


2.1 Definitions 

We will use log(z) to denote the natural logarithm of z throughout this article. 
Before continuing it will be useful to define the Gamma function r(z), 

/•OO 

TO) = / t z ~ 1 e~ i dt, zee, Re(z) > 0 (1) 

J o 

r(z) = (z-i)!, z€Z+ (2) 

and its relation, the incomplete Gamma function r(z, s), 

■S 1 fjl 

r(z,s) = (s - l)\e~ x V — z&1 + (3) 

z ' ml 

m =0 

and the Modified Bessel Function of the First Kind I a (z), 


^(z) = 


(z/2) 


2 m-\-OL 


m —0 


ro!r(m + a + 1)' 


(4) 


which also has the following integral representations [I] 

(z /2 r 


Ia(z) = 


%A : r(a + 1/2) J q 
(z/2) a 
s/t:Y(oi + 1 / 2 ) J_i 


e ±zcose sm 2d d d9. 


(a £ R) (5) 

J (1 -t 2 ) (a - l / 2) e ±zt dt. (a £ R, a > —0.5) (6) 


Also of interest is the logarithm of this quantity (using the second integral 
definition ®0, 


log (I a (z)) = log 
= log 


(f) c 


V^F(a + 1/2) 


J (1 -t 2 )(“- 1 / 2 )e ±zt dt 


(f) c 


log 


J (1 - f2)(a-l/2)g±*t dt 

1 


V^T(a + 1/2) 

log a — log v / 7rr(a + 1/2) + log J (1 -t 2 )(“- 1 / 2 ) e ±zt dt 


(7) 


Note that the second term does not depend on z. 

The Exponential Integral function E n (z) is given by, 


E a (z) = J 


1 

= z“ -1 r(l — a, z). 
An identity that will be useful is, 


t° 


-dt. 


J (1 -t) d e tK = -2 d - 1 K_ d (2 K )e K . d> 0 


( 8 ) 


(9) 
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2.2 The von Mises Fisher (VMF) distribution 

The probability density function (PDF) of the VMF distribution for a random 
d-dimensional unit vector x(||x || 2 = 1 ) is given by: 

M d (n, K )=c d (K)e^' x 1 xeS d -\ (10) 


where the normalisation constant c d (n) is given by, 

K d/2-1 

Cd ^ (27r) d / 2 J d/2 _ 1 (K)' 


( 11 ) 


The (non-symmetric) Kullback Leibler (KL)-Divergence from one probability 
distributions q(x.) to another probability distribution p(x) is defined as, 


KL ( 9 (x)||p(x)) = [ q(x) log ^ 7 —y dx. 




log 


P(x) 
p(x). 


( 12 ) 

(13) 


Although this is general to any two distributions, we will assume that p(x) is 
the “prior” distribution and g(x) is the “posterior” distribution as commonly 
used in Bayesian analysis. 


3 KL-Divergence for the VMF Distribution 


3.1 General Case 


We will assume that we have prior and posterior distributions defined over 
vectors x G R d , ||x|| 2 = 1 as follows, 

p(x) ~ M d (n p ,K p ), 

g(x) ~ M d (fi q ,K q ). (14) 

We will now derive the KL-Divergence for two VMF distributions. The main 
problem in doing so will be the the normalisation constants c d (K p ) and c d (n q ). 

Theorem 3.1 For prior and posterior distributions as defined above over vec¬ 
tors x G R. d , ||x|| 2 = l,d < 00, d ode Q we have 

' K m 

KL (g(x)||p(x)) <K q - KpHpUq + d * log(Kg) + 

m =1 

- 1 o s( k p) + <?(<? + 1) log d° - d° 2 + 1 

(15) 


Proof From (12 1 , letting d* = § — 1, d° = and d* = we have, 


kl 0?( x )IIp( x )) = [ ?( x )log^|dx, 

dx R x ) 

1 For even d we can simply add a “null” dimension 
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= / g(x) logc d ( Kq )e Kq ^ x -\ogc d ( Kp )eW 


dx, 


= f q(x) [log Cd (K 9 ) - logc d ( Kp ) + Kg/UgX - Kpt x' p x] dx, 

J X 

= f l( x ) [d* log(Kg) - (d/2) log(27r) - log I d * (n q ) 

J X 

-d* log(«p) + (d/2) log(27r) + log I d * (k p ) + n q ^' q x - K p fj. p x] dx, 

= [ g(x) 

J x 


d* log 


- log I d * (K q ) + log I d * (K p ) + K q /J. q X - K p H p : 


= J l( x ) d* log - «pMp x 

- log (^) + log V^rr ^d* + 0 - log J ^(l-f 2 ) (d * 


-1/2) + K ,i 


e* Kqt dt 


dx 

( 16 ) 


V . 2 ^cr dt 


,2\d° e ±Kpt dt 


dx 


dx 


+ log('J) <i - log VtiT ^d* + ^ j + log J (1 — t 2 ) (d * -1/2) e ±Kpi dt 

(Using 0) 

= J d( x ) d*l 0 g^^ +K g /X , g X-Kp/LXpX-d*l 0 g^) + d*log( 

- log J (1 - t 2 ) d ° e ±Kqt dt + log J (1 - t 2 ) 

= f q(x) [n qt j,' q x - Kpu'pX 
J X 

- log J (1 - t 2 ) d °e ±Kqt dt + log J (1 - t 2 ) 

= [ q(x) [ Kqf i' q x - Kpn' p x 

J X 

- log — 2^~E- d o(2K q )e Kq + log ~2^~E- d o(2K p )e Kp dx 

(Using 0) (17) 

= / q(x) [n q n' q x - Kpfi'pX - K q + K p - log [£’_ rf o (2 k,)] + log [£_<*<> (2 k p )]] dx 
J X 

= / ?( x ) K(/4 x - 1 ) - «p(/4 x - 1 ) 

J X 

- log (2K" d ’r (d*, 2rc g )) + log (2«;“ d ’r (d*, 2/Cp)j dx 

(Using the definition of the Exponential Integral function 0) (18) 

= / g(x) [/c g (/u' g x - 1 ) - n p (n' p x - 1 ) + d* log(2/t g ) - d* log(2« g ) 

J X 

- log (r (d*, 2/c,)) + log (r (d*, 2 k p ))] dx 

= / g(x) [Kq(Mq x - 1) - Kp(Mp x - 1) + d* log(2K g ) ~ d* log(2/v g ) 

J X 


/ K m\ / 

-log(dO!e-^^^]+log dde-^ 


m—0 


m—0 


P 

to ! 


dx 
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(Using ([3]) and that d* — 1 = d°) (19) 

= / g(x) [n q {n' q x - 1) - /%>(Mp x - 1) + d* \og(n q ) - d * log(Kp) + - 

</ X 




m 


P 

to ! 


dx 


- lo e(£^j +k) «l£ 

\m —0 / \m —0 

= f g(x) [ft,/*',* - Wp/x^x + d* log(K,) - d* log(Kp) 

J X 

/ d* .,m \ / d° 

~ log ( Ed) +log (E 


\m—0 / 

Further simplifications: 


0 


P 

to ! 


dx 


< f q(x) [ Kq »' q x - K P n' p x + d* log(«,) ^ d* log(K p ) 
J X 

/ d° „m\ / K n.' 

-‘•IE Sf + 


^m =0 


\m—0 


(by Jensen’s inequality) 

= [ q (x) [ K y q x - K y p x + d* io g (K,) - d* iog(Kp) 

J X 


dx 


+ log E - E ( ml °g( K p) _ 1 o S to! ) 

\m—0 * / m—0 

< [ q(x) [ K y q x - Kp/z^x + d* log(K,) - d* log(Kp) 

J X 

/ K m\ d° 

+ log ( -^T J ~ (to l°g( K p) - TO log TO + TO - 1) 

\m —0 / m—1 

(using n log — + 1 < log n\ < (n + 1) log + 1) 


dx 


= / g(x) 


K 9 /XqX - Kp/2pX + d* log(Kg) - d* log (k p ) + log ( 


\m= 0 


— ^ (TOlog(Kp) — to log to) — d°(d° + 1) + (d° + 1) 


dx 


= / d(x) 


/ 

KgMqX - Kp/2pX + d* log(Kg) - d* log(fvp) + log ( ^ 


\m= 0 


— ^ (to log(ftp) — to log to) — d° 2 + 1 


dx 


= / g(x) 


- Kp/2),x + d* log(Kg) - d* log(Kp) + log ( 


0 


- ("i log (Kp)) + d°(d° + 1) log d° - d° 2 + 1 


dx 


( 20 ) 


( 21 ) 


( 22 ) 
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= / <?(x) 


KqU'qX - Kp/X^X + d* log(/C,) - d * log(«p) + log I ^ 


Vm -0 


m! 


-d^d 0 + 1) log(K p ) + d°(d° + 1) log d° - d° 2 + 1 


dx 


= / <Z(X) 


/ a 

k 9 /x^x - K p /J,' p x + d * log(Kg) - d* log(Kp) + log ( ^y 


(d — 3 ) 2 d — 3 


m —0 
ro j<>2 


log(Avp) + d°(d° + 1) log d° — d° + 1 


dx 


= / d(x) 




k 9 /x^x - Kp/2pX + d* log(Kg) + lo g(X!^T 
d 2 - 2d + 1 


m —0 

ro jo2 


= / g(x) 


log(Kp) + d°(d° + 1) log d° — d° + 1 
/ d‘ 


dx 


K 9 /X^X - KpHpX + d* log(Kg) + log ( 1 + ^ ^ 
d 2 - 2d + 1 


m—1 

JO 2 


< / g(x) 


log(Kp) + d°(d° + 1) log d° — d° + 1 

d c 


dx 


K</A*gX - KpMpX + d* log(Kg) + 
d 2 - 2d + 1 


m=l 

'O/JO I JO jo2 


log(Kp) + d°(d° + 1) log d° — d°“ + 1 


dx 


(using n > log(l + n) > yf-, (n > -1)) 


= / <?(x) 


m 


Kqfl'qX - KpHpX + d* log(Kg) + ^ ^77 

d 2 - 2d + 1 


m=l 


log(Kp) + d°(d° + 1) log d° — d° 2 + 1 


dx 


m 


= Kg - KpflpUg + d* log(Kg) + ^ -\ 

m—1 m ' 

' d 2 — 2 d +1 


log( k p ) + d°(d° + 1) logd° — d° 2 + 1 


(as f x q(x) = 1 , and E[x] = y i , and /.i' n = 1 ) 


( 23 ) 


( 24 ) 


The term (J>' q /J- p can be seen as the cosine distance between the prior and 

postieror mean vectors. For 0 < K q < 1 , the term Yh m =i ~mT — K i- However for 
large K q and large d this term can grow very large. 


Special case: uniform prior 

Since the VMF distribution is defined on the S'® -1 , hypersphere, which is actu¬ 
ally a specific case of a Stiefel manifold where r = 1 is the radius. The Stiefel 


6 



manifold has finite area, 


and so, 


r(d,r) = 


Tl I-TT 




p-j+ 1 ^ ! 


(25) 


r{d,1 ) 



(26) 


For the special case of the uniform prior (more precisely lim^^o)) the prior 
PDF reduces to, 


M d (i u,k) = c d ( 0)e° 

_r(f) 

o ^ ’ 

Z7T2 


(27) 


which is simply one over the area on the manifold. This leads to a simpler form 
for the KL-divergence. 


Corollary 3.2 For prior and posterior distributions as defined above over vec¬ 
tors x £ R d , ||x|| 2 = 1 ,d< oo, we have 


KL(g(x) | |p(x)) = K q -d* log 2 


(28) 


Proof 


<?( x ) 


KL (<7( x )lb( x )) = / ?( x )log-wwdx 

dx M x ) 


logc^K^e^* - logc d (0) 

dx, 

KqH' q yi + log c d (K 9 ) - log T 

^)+log(2 I rJ) 


dx, 


= / d( x ) [« ? K X + log c d (Kq) - log(d*)! + (d/2) log (2 tt)] dx, 

J X 

= f d( x ) [KgMqX + d* log(Kg) - (d/2) log(27r) 

J X 

-log 4* (k 9 ) -log(d*)! + (d/2) log (27 t)] dx, 

= / d( x ) [« 9 A‘g x + d*log(Kg)-logJ d .(K g ) — log(d*)!] dx, 

J X 

= _£d(x) ^x + d*log(« 9 )-log(^) d +lo g r^-log(d*)! 

= /d( x ) « 9 Mg x + l°g( K 9) - rf *l°g(y) dx - 

= / g(x) [k^x - d* log 2] dx, 

J X 


dx, 


= riq - d* log 2, 


(29) 
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For this special case, it can be seen that the dependence on the dimension 
is much more benign. This could prove useful for further computation ( e.g. if 
the KL-divergence were to be used in a probably approximately correct (PAC)- 
Bayes bound HD- 

4 Conclusions 

We have presented a derivation of the Kullback Leibler (KL)-divergence for the 
von Mises Fisher (VMF)-distribution, including the special case of a uniform 
prior over the hypersphere. 
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